You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by hong <mi...@163.com> on 2008/07/08 15:36:55 UTC

Is map running in parallel?

Hi!

when we run map-reduce on hadoop, is the map running in a single node  
or running in parallel on several nodes?

If it is running in parallel, the input file should be split. how can  
hadoop split input file in right position? For example, in wordcount,
the input file cannot be divided at the middle of a word. 
  


Re: Is map running in parallel?

Posted by heyongqiang <he...@software.ict.ac.cn>.
hi,师兄
javadoc of InputFormat may help.
The split job is exactly what InputFormat classes do. hadoop itself implemented several InputFormat subclasses.




heyongqiang
2008-07-08



发件人: hong
发送时间: 2008-07-08 21:08:13
收件人: core-user@hadoop.apache.org
抄送: 
主题: Is map running in parallel?

Hi!

when we run map-reduce on hadoop, is the map running in a single node  
or running in parallel on several nodes?

If it is running in parallel, the input file should be split. how can  
hadoop split input file in right position? For example, in wordcount,
the input file cannot be divided at the middle of a word. 
  

Re: Is map running in parallel?

Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Tuesday 08 July 2008 15:36:55 hong wrote:
> Hi!
>
> when we run map-reduce on hadoop, is the map running in a single node
> or running in parallel on several nodes?
>
> If it is running in parallel, the input file should be split. how can
> hadoop split input file in right position? For example, in wordcount,
> the input file cannot be divided at the middle of a word.

Read the docs, but basically with Java map-reduce jobs it works mostly with 
following a clever distributed algorithm.

Never used it, but the following algorithm would work:

Each job get a start offset and length.

If the offset is 0, process the first record. If not, skip to the first 
end-of-record position.

Process records as long you the start of the record is inside the range you 
got assigned.

This way, all records get processed by some mapper.

Andreas