You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Xuan Dzung Doan <do...@yahoo.com> on 2008/06/25 05:02:55 UTC

How Mappers function and solultion for my input file problem?

Hi,

I'm a Hadoop newbie. My question is as follows:

The level of parallelism of a job, with respect to mappers, is largely the number of map tasks spawned, which is equal to the number of InputSplits. But within each InputSplit, there may be many records (many input key-value pairs), each is processed by one separate call to the map() method. So are these calls within one single map task also executed in parallel by the framework?

Now I'm going to write a small map/reduce program that handles a small input data file (about a few tens of Mbs). The data file is a text file that contains many variable-length string sequences delimited by the star (*) character. The program will extract the string sequences from the file and handle each individually in parallel.

Looks like there may be 2 ways to handle the input file:

1) Not have it split by the framework. In other words, there will be only one InputSplit which is the entire file and one map task. The RecordReader (?) will be responsible for extracting individual string sequences and feeding them to the map() method calls. But if the answer to my previous question is No (the calls are not processed in parallel), it's pointless to write this program as a map/reduce one.

(2) Not feed the data file as an input file to the job, but instead cache it in the DistributedCache. Spawn as many map tasks as needed (probably optimally as many as the number of sequences). These tasks will not have input file, but instead take the input data from the file in the cache (there must be some mechanism to make sure each task handles a different sequence).

I'd highly appreciate if anyone could answer my question and/or comment on these 2 ways of handling the input file or suggest other ways of doing that.

Thanks,
David.

Re: How Mappers function and solultion for my input file problem?

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jun 24, 2008 at 10:31 PM, Amar Kamat <am...@yahoo-inc.com> wrote:

> Xuan Dzung Doan wrote:
>
>>
>>
>> The level of parallelism of a job, with respect to mappers, is largely the
>> number of map tasks spawned, which is equal to the number of InputSplits.
>> But within each InputSplit, there may be many records (many input key-value
>> pairs), each is processed by one separate call to the map() method. So are
>> these calls within one single map task also executed in parallel by the
>> framework?
>>
>>
>>
> Afaik no.

This might be a bit misunderstood.

Each task node does run a few map tasks and each of these could be
considered a "single map task executed in parallel".

It is definitely true that you have more than one map task, even per task
node.  But it is also true that you get many calls to map per map task.

Re: How Mappers function and solultion for my input file problem?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Xuan Dzung Doan wrote:
> Hi,
>
> I'm a Hadoop newbie. My question is as follows:
>
> The level of parallelism of a job, with respect to mappers, is largely the number of map tasks spawned, which is equal to the number of InputSplits. But within each InputSplit, there may be many records (many input key-value pairs), each is processed by one separate call to the map() method. So are these calls within one single map task also executed in parallel by the framework?
>
>   
Afaik no.
> Now I'm going to write a small map/reduce program that handles a small input data file (about a few tens of Mbs). The data file is a text file that contains many variable-length string sequences delimited by the star (*) character. The program will extract the string sequences from the file  and handle each individually in parallel.
>
> Looks like there may be 2 ways to handle the input file:
>
> 1) Not have it split by the framework. In other words, there will be only one InputSplit which is the entire file and one map task. The RecordReader (?) will be responsible for extracting individual string sequences and feeding them to the map() method calls. But if the answer to my previous question is No (the calls are not processed in parallel), it's pointless to write this program as a map/reduce one.
>
> (2) Not feed the data file as an input file to the job, but instead cache it in the DistributedCache. Spawn as many map tasks as needed (probably optimally as many as the number of sequences). These tasks will not have input file, but instead take the input data from the file in the cache (there must be some mechanism to make sure each task handles a different sequence).
>   
I would prefer writing my own InputFormat to do the splitting and have a 
very high replication factor for the input file. This way you can easily 
increase the file size without any additional changes.
> I'd highly appreciate if anyone could answer my question and/or comment on these 2 ways of handling the input file or suggest other ways of doing that.
>
> Thanks,
> David.
>
>
>
>       
>