You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Naama Kraus <na...@gmail.com> on 2008/03/10 09:57:35 UTC

File size and number of files considerations

Hi,

In our system, we plan to upload data into Hadoop from external sources and
use it later on for analysis tasks. The interface to the external
repositories allows us to fetch pieces of data in chunks. E.g. get n records
at a time. Records are relatively small, though the overall amount of data
is assumed to be large. For each repository, we fetch pieces of data in a
serial manner. Number of repositories is small (few of them).

My first step is to put the data in plain files in HDFS. My question is what
is the optimized file sizes to use. Many small files (to the extent of each
record in a file) ? - guess not. Few huge files each holding all data of
same type ? Or maybe put each chunk we get in a separate file, and close it
right after a chunk was uploaded ?

How would HFDS perform best, with few large files or more smaller files ? As
I wrote we plan to run MapReduce jobs over the data in the files in order to
organize the data and analyze it.

Thanks for any help,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: File size and number of files considerations

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Naama Kraus wrote:
> Hi,
> 
> Thanks all for the input.
> 
> Here are my further questions:
> 
> I can consolidate data off-line to have big enough files (>64M) or copy to
> dfs smaller files and then consolidate using MapReduce.

You can also let MapReduce read your original small local files and write them
into large hdfs file consolidating them so to speak on the fly.
Something like distcp (see related issues in Jira) but with custom processing of your inputs.

> 1. If I choose the first option, would the copy of a 64M file into dfs from
> a local file system perform well ?
> 
> 2. If I choose the second option, how would one suggest to implement it ? I
> am not sure how I control the size of the reduce output files.
> 
> 3. I had the impression that dfs splits large files and distributes splits
> around. Is that true ? 

Just wanted to mention that splits are logical not physical. It is not like
hdfs cuts files into pieces and moves them around. You can think of splits as
file ranges.

> If so, why should I mind if my files are extremely
> large ? Say Gigas or even Teras ? Doesn't dfs take care of it internally and
> thus scales up in terms of file size ? I am quoting from the HDFS
> architecture document in
> http://hadoop.apache.org/core/docs/current/hdfs_design.html#Large+Data+Sets
> "Applications that run on HDFS have large data sets. A typical file in HDFS
> is gigabytes to terabytes in size. Thus, HDFS is tuned to support large
> files."
> 
> 4. Is there further recommended material to read about these issues ?
> 
> Thanks, Naama
> 
> On Mon, Mar 10, 2008 at 6:43 PM, Amar Kamat <am...@yahoo-inc.com> wrote:
> 
> 
>>By chunks I meant basic unit of processing i.e a dfs block. Sorry for
>>the confusion, I should have mentioned it clearly. What I meant was in
>>case of files smaller than the default block size, the file becomes the
>>basic
>>unit for computation. Now one can have a very huge file and rely on the
>>dfs block size but a simpler approach would be create small files in the
>>beginning itself (if possible). This avoids playing around with the block
>>size and adds lesser confusion in terms of record boundaries etc. I dont
>>have any specific values for the file sizes but files with very small
>>sizes will cause lots of maps which will cause reducers to be slower. So
>>make sure to have files that form the logical unit of computation and good
>>enough size.
>>Thanks Ted for pointing it out.
>>Amar
>>
>>>On Mon, 10 Mar 2008, Ted Dunning wrote:
>>>
>>>Amar's comments are a little strange.
>>>
>>>Replication occurs at the block level, not the file level.  Storing data
>>
>>in
>>
>>>a small number of large files or a large number of small files will have
>>>less than a factor of two effect on number of replicated blocks if the
>>
>>small
>>
>>>files are >64MB.  Files smaller than that will hurt performance due to
>>
>>seek
>>
>>>costs.
>>>
>>>To address Naama's question, you should consolidate your files so that
>>
>>you
>>
>>>have files of at least 64 MB and preferably a bit larger than that.
>>
>> This
>>
>>>helps because it allows the reading of the files to proceed in a nice
>>>sequential manner which can greatly increase throughput.
>>>
>>>If consolidating these files off-line is difficult, it is easy to do in
>>
>>a
>>
>>>preliminary map-reduce step.  This will incur a one-time cost, but if
>>
>>you
>>
>>>are doing multiple passes over the data later, it will be worth it.
>>>
>>>
>>>On 3/10/08 3:12 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:
>>>
>>>
>>>>On Mon, 10 Mar 2008, Naama Kraus wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>In our system, we plan to upload data into Hadoop from external
>>
>>sources and
>>
>>>>>use it later on for analysis tasks. The interface to the external
>>>>>repositories allows us to fetch pieces of data in chunks. E.g. get n
>>
>>records
>>
>>>>>at a time. Records are relatively small, though the overall amount of
>>
>>data
>>
>>>>>is assumed to be large. For each repository, we fetch pieces of data
>>
>>in a
>>
>>>>>serial manner. Number of repositories is small (few of them).
>>>>>
>>>>>My first step is to put the data in plain files in HDFS. My question
>>
>>is what
>>
>>>>>is the optimized file sizes to use. Many small files (to the extent of
>>
>>each
>>
>>>>>record in a file) ? - guess not. Few huge files each holding all data
>>
>>of
>>
>>>>>same type ? Or maybe put each chunk we get in a separate file, and
>>
>>close it
>>
>>>>>right after a chunk was uploaded ?
>>>>>
>>>>
>>>>I think it should be more based on the size of the data you want to
>>>>process in a map which I think here is the chunk size, no?
>>>>Larger the file less the replicas and hence more the network transfers
>>
>>in
>>
>>>>case of more maps. In case of smaller file size the NN will be
>>
>>bottleneck
>>
>>>>but you will end up having more replicas for each map task and hence
>>
>>more
>>
>>>>locality.
>>>>Amar
>>>>
>>>>>How would HFDS perform best, with few large files or more smaller
>>
>>files ? As
>>
>>>>>I wrote we plan to run MapReduce jobs over the data in the files in
>>
>>order to
>>
>>>>>organize the data and analyze it.
>>>>>
>>>>>Thanks for any help,
>>>>>Naama
>>>>>
>>>>>
>>>
>>>
> 
> 
> 

Re: File size and number of files considerations

Posted by Naama Kraus <na...@gmail.com>.
Hi,

Thanks all for the input.

Here are my further questions:

I can consolidate data off-line to have big enough files (>64M) or copy to
dfs smaller files and then consolidate using MapReduce.

1. If I choose the first option, would the copy of a 64M file into dfs from
a local file system perform well ?

2. If I choose the second option, how would one suggest to implement it ? I
am not sure how I control the size of the reduce output files.

3. I had the impression that dfs splits large files and distributes splits
around. Is that true ? If so, why should I mind if my files are extremely
large ? Say Gigas or even Teras ? Doesn't dfs take care of it internally and
thus scales up in terms of file size ? I am quoting from the HDFS
architecture document in
http://hadoop.apache.org/core/docs/current/hdfs_design.html#Large+Data+Sets
"Applications that run on HDFS have large data sets. A typical file in HDFS
is gigabytes to terabytes in size. Thus, HDFS is tuned to support large
files."

4. Is there further recommended material to read about these issues ?

Thanks, Naama

On Mon, Mar 10, 2008 at 6:43 PM, Amar Kamat <am...@yahoo-inc.com> wrote:

> By chunks I meant basic unit of processing i.e a dfs block. Sorry for
> the confusion, I should have mentioned it clearly. What I meant was in
> case of files smaller than the default block size, the file becomes the
> basic
> unit for computation. Now one can have a very huge file and rely on the
> dfs block size but a simpler approach would be create small files in the
> beginning itself (if possible). This avoids playing around with the block
> size and adds lesser confusion in terms of record boundaries etc. I dont
> have any specific values for the file sizes but files with very small
> sizes will cause lots of maps which will cause reducers to be slower. So
> make sure to have files that form the logical unit of computation and good
> enough size.
> Thanks Ted for pointing it out.
> Amar
> >On Mon, 10 Mar 2008, Ted Dunning wrote:
> >
> > Amar's comments are a little strange.
> >
> > Replication occurs at the block level, not the file level.  Storing data
> in
> > a small number of large files or a large number of small files will have
> > less than a factor of two effect on number of replicated blocks if the
> small
> > files are >64MB.  Files smaller than that will hurt performance due to
> seek
> > costs.
> >
> > To address Naama's question, you should consolidate your files so that
> you
> > have files of at least 64 MB and preferably a bit larger than that.
>  This
> > helps because it allows the reading of the files to proceed in a nice
> > sequential manner which can greatly increase throughput.
> >
> > If consolidating these files off-line is difficult, it is easy to do in
> a
> > preliminary map-reduce step.  This will incur a one-time cost, but if
> you
> > are doing multiple passes over the data later, it will be worth it.
> >
> >
> > On 3/10/08 3:12 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:
> >
> >> On Mon, 10 Mar 2008, Naama Kraus wrote:
> >>
> >>> Hi,
> >>>
> >>> In our system, we plan to upload data into Hadoop from external
> sources and
> >>> use it later on for analysis tasks. The interface to the external
> >>> repositories allows us to fetch pieces of data in chunks. E.g. get n
> records
> >>> at a time. Records are relatively small, though the overall amount of
> data
> >>> is assumed to be large. For each repository, we fetch pieces of data
> in a
> >>> serial manner. Number of repositories is small (few of them).
> >>>
> >>> My first step is to put the data in plain files in HDFS. My question
> is what
> >>> is the optimized file sizes to use. Many small files (to the extent of
> each
> >>> record in a file) ? - guess not. Few huge files each holding all data
> of
> >>> same type ? Or maybe put each chunk we get in a separate file, and
> close it
> >>> right after a chunk was uploaded ?
> >>>
> >> I think it should be more based on the size of the data you want to
> >> process in a map which I think here is the chunk size, no?
> >> Larger the file less the replicas and hence more the network transfers
> in
> >> case of more maps. In case of smaller file size the NN will be
> bottleneck
> >> but you will end up having more replicas for each map task and hence
> more
> >> locality.
> >> Amar
> >>> How would HFDS perform best, with few large files or more smaller
> files ? As
> >>> I wrote we plan to run MapReduce jobs over the data in the files in
> order to
> >>> organize the data and analyze it.
> >>>
> >>> Thanks for any help,
> >>> Naama
> >>>
> >>>
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: File size and number of files considerations

Posted by Amar Kamat <am...@yahoo-inc.com>.
By chunks I meant basic unit of processing i.e a dfs block. Sorry for 
the confusion, I should have mentioned it clearly. What I meant was in 
case of files smaller than the default block size, the file becomes the basic 
unit for computation. Now one can have a very huge file and rely on the 
dfs block size but a simpler approach would be create small files in the 
beginning itself (if possible). This avoids playing around with the block 
size and adds lesser confusion in terms of record boundaries etc. I dont 
have any specific values for the file sizes but files with very small 
sizes will cause lots of maps which will cause reducers to be slower. So 
make sure to have files that form the logical unit of computation and good 
enough size.
Thanks Ted for pointing it out.
Amar
>On Mon, 10 Mar 2008, Ted Dunning wrote:
>
> Amar's comments are a little strange.
>
> Replication occurs at the block level, not the file level.  Storing data in
> a small number of large files or a large number of small files will have
> less than a factor of two effect on number of replicated blocks if the small
> files are >64MB.  Files smaller than that will hurt performance due to seek
> costs.
>
> To address Naama's question, you should consolidate your files so that you
> have files of at least 64 MB and preferably a bit larger than that.  This
> helps because it allows the reading of the files to proceed in a nice
> sequential manner which can greatly increase throughput.
>
> If consolidating these files off-line is difficult, it is easy to do in a
> preliminary map-reduce step.  This will incur a one-time cost, but if you
> are doing multiple passes over the data later, it will be worth it.
>
>
> On 3/10/08 3:12 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:
>
>> On Mon, 10 Mar 2008, Naama Kraus wrote:
>>
>>> Hi,
>>>
>>> In our system, we plan to upload data into Hadoop from external sources and
>>> use it later on for analysis tasks. The interface to the external
>>> repositories allows us to fetch pieces of data in chunks. E.g. get n records
>>> at a time. Records are relatively small, though the overall amount of data
>>> is assumed to be large. For each repository, we fetch pieces of data in a
>>> serial manner. Number of repositories is small (few of them).
>>>
>>> My first step is to put the data in plain files in HDFS. My question is what
>>> is the optimized file sizes to use. Many small files (to the extent of each
>>> record in a file) ? - guess not. Few huge files each holding all data of
>>> same type ? Or maybe put each chunk we get in a separate file, and close it
>>> right after a chunk was uploaded ?
>>>
>> I think it should be more based on the size of the data you want to
>> process in a map which I think here is the chunk size, no?
>> Larger the file less the replicas and hence more the network transfers in
>> case of more maps. In case of smaller file size the NN will be bottleneck
>> but you will end up having more replicas for each map task and hence more
>> locality.
>> Amar
>>> How would HFDS perform best, with few large files or more smaller files ? As
>>> I wrote we plan to run MapReduce jobs over the data in the files in order to
>>> organize the data and analyze it.
>>>
>>> Thanks for any help,
>>> Naama
>>>
>>>
>
>

Re: File size and number of files considerations

Posted by Ted Dunning <td...@veoh.com>.
Amar's comments are a little strange.

Replication occurs at the block level, not the file level.  Storing data in
a small number of large files or a large number of small files will have
less than a factor of two effect on number of replicated blocks if the small
files are >64MB.  Files smaller than that will hurt performance due to seek
costs.

To address Naama's question, you should consolidate your files so that you
have files of at least 64 MB and preferably a bit larger than that.  This
helps because it allows the reading of the files to proceed in a nice
sequential manner which can greatly increase throughput.

If consolidating these files off-line is difficult, it is easy to do in a
preliminary map-reduce step.  This will incur a one-time cost, but if you
are doing multiple passes over the data later, it will be worth it.


On 3/10/08 3:12 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:

> On Mon, 10 Mar 2008, Naama Kraus wrote:
> 
>> Hi,
>> 
>> In our system, we plan to upload data into Hadoop from external sources and
>> use it later on for analysis tasks. The interface to the external
>> repositories allows us to fetch pieces of data in chunks. E.g. get n records
>> at a time. Records are relatively small, though the overall amount of data
>> is assumed to be large. For each repository, we fetch pieces of data in a
>> serial manner. Number of repositories is small (few of them).
>> 
>> My first step is to put the data in plain files in HDFS. My question is what
>> is the optimized file sizes to use. Many small files (to the extent of each
>> record in a file) ? - guess not. Few huge files each holding all data of
>> same type ? Or maybe put each chunk we get in a separate file, and close it
>> right after a chunk was uploaded ?
>> 
> I think it should be more based on the size of the data you want to
> process in a map which I think here is the chunk size, no?
> Larger the file less the replicas and hence more the network transfers in
> case of more maps. In case of smaller file size the NN will be bottleneck
> but you will end up having more replicas for each map task and hence more
> locality.
> Amar
>> How would HFDS perform best, with few large files or more smaller files ? As
>> I wrote we plan to run MapReduce jobs over the data in the files in order to
>> organize the data and analyze it.
>> 
>> Thanks for any help,
>> Naama
>> 
>> 


Re: File size and number of files considerations

Posted by Amar Kamat <am...@yahoo-inc.com>.
On Mon, 10 Mar 2008, Naama Kraus wrote:

> Hi,
>
> In our system, we plan to upload data into Hadoop from external sources and
> use it later on for analysis tasks. The interface to the external
> repositories allows us to fetch pieces of data in chunks. E.g. get n records
> at a time. Records are relatively small, though the overall amount of data
> is assumed to be large. For each repository, we fetch pieces of data in a
> serial manner. Number of repositories is small (few of them).
>
> My first step is to put the data in plain files in HDFS. My question is what
> is the optimized file sizes to use. Many small files (to the extent of each
> record in a file) ? - guess not. Few huge files each holding all data of
> same type ? Or maybe put each chunk we get in a separate file, and close it
> right after a chunk was uploaded ?
>
I think it should be more based on the size of the data you want to 
process in a map which I think here is the chunk size, no?
Larger the file less the replicas and hence more the network transfers in 
case of more maps. In case of smaller file size the NN will be bottleneck 
but you will end up having more replicas for each map task and hence more locality.
Amar
> How would HFDS perform best, with few large files or more smaller files ? As
> I wrote we plan to run MapReduce jobs over the data in the files in order to
> organize the data and analyze it.
>
> Thanks for any help,
> Naama
>
>