You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Lei Chen <lc...@gmail.com> on 2006/04/20 08:20:50 UTC
How is big file got divided
Hi,
I am a new user of hadoop. This project looks cool.
There is one question about the MapReduce. I want to process a big
file. To my understanding, hadoop will partition big file into block and
each block is assigned to a worker. Then, how does hadoop decide where to
cut those big files? Does it guarantee that each line in the input file will
be assigned to one block and no line will be divided into two parts in
different blocks?
Lei
Re: How is big file got divided
Posted by Teppo Kurki <tj...@iki.fi>.
Lei Chen wrote:
>It seems that big
>file can be split within one line. But the map/reduce will still work
>properly since the dfs layer will hide the block layout information from the
>map/reduce tasks.
>
>
It's up to the InputFormat to handle records that are split on FileSplit
boundaries.
TextInputFormat apparently reads a line past the end of the Split
boundary and starts reading from the first linebreak encountered. See
http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup
for details.
(I added this info to http://wiki.apache.org/lucene-hadoop/HadoopMapReduce).
Re: How is big file got divided
Posted by Lei Chen <lc...@gmail.com>.
Thanks, Arbow
I checked the code and also carried out some experiments. It seems that big
file can be split within one line. But the map/reduce will still work
properly since the dfs layer will hide the block layout information from the
map/reduce tasks.
Lei
On 4/20/06, Arbow <av...@gmail.com> wrote:
>
> Hi, Lei Chen:
>
> You can have a view on org.apache.hadoop.mapred.InputFormatBase, I
> think it will help you.
>
> On 4/20/06, Lei Chen <lc...@gmail.com> wrote:
> > Hi,
> > I am a new user of hadoop. This project looks cool.
> >
> > There is one question about the MapReduce. I want to process a big
> > file. To my understanding, hadoop will partition big file into block and
> > each block is assigned to a worker. Then, how does hadoop decide where
> to
> > cut those big files? Does it guarantee that each line in the input file
> will
> > be assigned to one block and no line will be divided into two parts in
> > different blocks?
> >
> > Lei
> >
> >
>
Re: How is big file got divided
Posted by Arbow <av...@gmail.com>.
Hi, Lei Chen:
You can have a view on org.apache.hadoop.mapred.InputFormatBase, I
think it will help you.
On 4/20/06, Lei Chen <lc...@gmail.com> wrote:
> Hi,
> I am a new user of hadoop. This project looks cool.
>
> There is one question about the MapReduce. I want to process a big
> file. To my understanding, hadoop will partition big file into block and
> each block is assigned to a worker. Then, how does hadoop decide where to
> cut those big files? Does it guarantee that each line in the input file will
> be assigned to one block and no line will be divided into two parts in
> different blocks?
>
> Lei
>
>