You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Michael Kintzer <mi...@zerk.com> on 2010/02/25 18:45:40 UTC

cluster involvement trigger

Hi,

We are using the streaming API.    We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution.

With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1?

For example, we are trying to figure out if hadoop is more efficient at processing:
a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the "-archives" argument, or 
b) a single input file containing all the raw data represented by the 100K or 1M small files.

With (a), our input file is <64MB.   With (b) our input file is very large.

Thanks for any insight,

-Michael

Re: cluster involvement trigger

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
You mentioned you pass the files packed together using -archives option. This will uncompress the archive on the compute node itself, so the namenode won't be hampered in this case. However, cleaning up the distributed cache is a tricky scenario ( user doesn't have explicit control over this ), you may search this list for many discussions pertaining to this. And while on the topic of archives, while it may not be practical for you as of now, but Hadoop Archives (har) provide similar functionality.
Hope this helps.

Amogh


On 2/27/10 12:53 AM, "Michael Kintzer" <mi...@zerk.com> wrote:

Amogh,

Thank you for the detailed information.   Our initial prototyping seems to agree with your statements below, i.e. a single large input file is performing better than an index file + an archive of small files.   I will take a look at the CombineFileInputFormat as you suggested.

One question.   Since the many small input files are all in a single jar archive managed by the name node, does that still hamper name node performance?   I was under the impression these archives are are only unpacked into the temporary map reduce file space (and I'm assuming cleaned up after map-reduce completes).   Does the name node need to store the metadata of each individual file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

> Hi,
> The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ).
> You say that you have too many small files. In general each of these small files  ( < 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ).
> On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files.
>
> Amogh
>
>
> On 2/25/10 11:15 PM, "Michael Kintzer" <mi...@zerk.com> wrote:
>
> Hi,
>
> We are using the streaming API.    We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution.
>
> With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1?
>
> For example, we are trying to figure out if hadoop is more efficient at processing:
> a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the "-archives" argument, or
> b) a single input file containing all the raw data represented by the 100K or 1M small files.
>
> With (a), our input file is <64MB.   With (b) our input file is very large.
>
> Thanks for any insight,
>
> -Michael
>



Re: cluster involvement trigger

Posted by Michael Kintzer <mi...@zerk.com>.
Amogh,

Thank you for the detailed information.   Our initial prototyping seems to agree with your statements below, i.e. a single large input file is performing better than an index file + an archive of small files.   I will take a look at the CombineFileInputFormat as you suggested.     

One question.   Since the many small input files are all in a single jar archive managed by the name node, does that still hamper name node performance?   I was under the impression these archives are are only unpacked into the temporary map reduce file space (and I'm assuming cleaned up after map-reduce completes).   Does the name node need to store the metadata of each individual file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

> Hi,
> The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ).
> You say that you have too many small files. In general each of these small files  ( < 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ).
> On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files.
> 
> Amogh
> 
> 
> On 2/25/10 11:15 PM, "Michael Kintzer" <mi...@zerk.com> wrote:
> 
> Hi,
> 
> We are using the streaming API.    We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution.
> 
> With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1?
> 
> For example, we are trying to figure out if hadoop is more efficient at processing:
> a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the "-archives" argument, or
> b) a single input file containing all the raw data represented by the 100K or 1M small files.
> 
> With (a), our input file is <64MB.   With (b) our input file is very large.
> 
> Thanks for any insight,
> 
> -Michael
> 


Re: cluster involvement trigger

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ).
You say that you have too many small files. In general each of these small files  ( < 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ).
On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files.

Amogh


On 2/25/10 11:15 PM, "Michael Kintzer" <mi...@zerk.com> wrote:

Hi,

We are using the streaming API.    We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution.

With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1?

For example, we are trying to figure out if hadoop is more efficient at processing:
a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the "-archives" argument, or
b) a single input file containing all the raw data represented by the 100K or 1M small files.

With (a), our input file is <64MB.   With (b) our input file is very large.

Thanks for any insight,

-Michael