You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Lei Chen <lc...@gmail.com> on 2006/04/20 08:20:50 UTC

How is big file got divided

Hi,
     I am a new user of hadoop. This project looks cool.

     There is one question about the MapReduce. I want to process a big
file. To my understanding, hadoop will partition big file into block and
each block is assigned to a worker. Then, how does hadoop decide where to
cut those big files? Does it guarantee that each line in the input file will
be assigned to one block and no line will be divided into two parts in
different blocks?

Lei

Re: How is big file got divided

Posted by Teppo Kurki <tj...@iki.fi>.

Lei Chen wrote:

>It seems that big
>file can be split within one line. But the map/reduce will still work
>properly since the dfs layer will hide the block layout information from the
>map/reduce tasks.
>  
>

It's up to the InputFormat to handle records that are split on FileSplit 
boundaries.

TextInputFormat apparently reads a line past the end of the Split 
boundary and starts reading from the first linebreak encountered. See 
http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup 
for details.

(I added this info to http://wiki.apache.org/lucene-hadoop/HadoopMapReduce).

Re: How is big file got divided

Posted by Lei Chen <lc...@gmail.com>.

Thanks, Arbow

I checked the code and also carried out some experiments. It seems that big
file can be split within one line. But the map/reduce will still work
properly since the dfs layer will hide the block layout information from the
map/reduce tasks.

Lei

On 4/20/06, Arbow <av...@gmail.com> wrote:
>
> Hi, Lei Chen:
>
> You can have a view on org.apache.hadoop.mapred.InputFormatBase, I
> think it will help you.
>
> On 4/20/06, Lei Chen <lc...@gmail.com> wrote:
> > Hi,
> >      I am a new user of hadoop. This project looks cool.
> >
> >      There is one question about the MapReduce. I want to process a big
> > file. To my understanding, hadoop will partition big file into block and
> > each block is assigned to a worker. Then, how does hadoop decide where
> to
> > cut those big files? Does it guarantee that each line in the input file
> will
> > be assigned to one block and no line will be divided into two parts in
> > different blocks?
> >
> > Lei
> >
> >
>

Re: How is big file got divided

Posted by Arbow <av...@gmail.com>.

Hi, Lei Chen:

  You can have a view on org.apache.hadoop.mapred.InputFormatBase, I
think it will help you.

On 4/20/06, Lei Chen <lc...@gmail.com> wrote:
> Hi,
>      I am a new user of hadoop. This project looks cool.
>
>      There is one question about the MapReduce. I want to process a big
> file. To my understanding, hadoop will partition big file into block and
> each block is assigned to a worker. Then, how does hadoop decide where to
> cut those big files? Does it guarantee that each line in the input file will
> be assigned to one block and no line will be divided into two parts in
> different blocks?
>
> Lei
>
>