You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Udaya Lakshmi <ud...@gmail.com> on 2010/01/29 04:34:07 UTC

File split query

Hi,
   When framework splits a file, will it happen that some part of a
line falls in one split and the other part in some other split? Or is
the framework going to take care that it always splits at the end of
the line?

Thanks,
Udaya.

Re: File split query

Posted by Prabhu Hari Dhanapal <dr...@gmail.com>.
I guess this would be a better answer


A FileSplit is merely a description of the boundaries. e.g., "bytes 0 to
9999" and "bytes 10000 to 19999". The Mapper then interprets the boundaries
described by a FileSplit in a way that makes sense at the data level.  The
FileSplit does not actually physically contain the data to be mapped over.

So mapper 1 will open a file via the InputFormat and start reading at byte
0, and stop reading when it gets to its "final record," which is defined as
the first record which stops after byte 9999. If it has to read through
bytes 10020, that's ok. The stream used to read the bytes from the file will
not "cut off" at 9999.

Mapper 2 starts reading at byte 10000. It finds the first newline at byte
10020, so the first "real" record it processes starts at byte 10021.


http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/<d6...@mail.gmail.com><http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3Cd6d7c4410906110012l3629748agf064176b224c8f96@mail.gmail.com%3E>

On Fri, Jan 29, 2010 at 1:55 AM, Prabhu Hari Dhanapal <
dragonzsnake@gmail.com> wrote:

> The splitting does not know anything about the input file's internal
> logical structure, for example line-oriented text files are split on
> arbitrary byte boundaries.
>
>
> On Fri, Jan 29, 2010 at 1:49 AM, .ke. sivakumar <ke...@gmail.com>wrote:
>
>> Hadoop will take care of it. If the split is supposed to be at the middle
>> of
>> the
>> line, then it will be extended till the end. Though the split limit will
>> be
>> exceeded
>> by few bytes.
>>
>>
>>
>> On Thu, Jan 28, 2010 at 7:34 PM, Udaya Lakshmi <ud...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >   When framework splits a file, will it happen that some part of a
>> > line falls in one split and the other part in some other split? Or is
>> > the framework going to take care that it always splits at the end of
>> > the line?
>> >
>> > Thanks,
>> > Udaya.
>> >
>>
>
>
>
> --
> Hari
>



-- 
Hari

Re: File split query

Posted by Prabhu Hari Dhanapal <dr...@gmail.com>.
The splitting does not know anything about the input file's internal logical
structure, for example line-oriented text files are split on arbitrary byte
boundaries.

On Fri, Jan 29, 2010 at 1:49 AM, .ke. sivakumar <ke...@gmail.com>wrote:

> Hadoop will take care of it. If the split is supposed to be at the middle
> of
> the
> line, then it will be extended till the end. Though the split limit will be
> exceeded
> by few bytes.
>
>
>
> On Thu, Jan 28, 2010 at 7:34 PM, Udaya Lakshmi <ud...@gmail.com> wrote:
>
> > Hi,
> >   When framework splits a file, will it happen that some part of a
> > line falls in one split and the other part in some other split? Or is
> > the framework going to take care that it always splits at the end of
> > the line?
> >
> > Thanks,
> > Udaya.
> >
>



-- 
Hari

Re: File split query

Posted by ".ke. sivakumar" <ke...@gmail.com>.
Hadoop will take care of it. If the split is supposed to be at the middle of
the
line, then it will be extended till the end. Though the split limit will be
exceeded
by few bytes.



On Thu, Jan 28, 2010 at 7:34 PM, Udaya Lakshmi <ud...@gmail.com> wrote:

> Hi,
>   When framework splits a file, will it happen that some part of a
> line falls in one split and the other part in some other split? Or is
> the framework going to take care that it always splits at the end of
> the line?
>
> Thanks,
> Udaya.
>

Re: File split query

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
In general, the file split may break the records, its the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>.

Amogh


On 1/29/10 9:04 AM, "Udaya Lakshmi" <ud...@gmail.com> wrote:

Hi,
   When framework splits a file, will it happen that some part of a
line falls in one split and the other part in some other split? Or is
the framework going to take care that it always splits at the end of
the line?

Thanks,
Udaya.