You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Yabo-Arber Xu <ar...@gmail.com> on 2009/06/02 05:05:55 UTC

Re: InputFormat for fixed-width records?

I have a follow-up question on this thread: How do we make sure that at the
getFileSplit phase, there is no records that cross the boundary of different
file splits?

To explain my point better, for example, if each of my record is 100 bytes,
would there be such case that there is some record whose key was put in the
1st filesplit, while its value was put in the second split?

Best,
Arber

On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org> wrote:

> On May 28, 2009, at 5:15 AM, Stuart White wrote:
>
>  I need to process a dataset that contains text records of fixed length
>> in bytes.  For example, each record may be 100 bytes in length
>>
>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just have the binary data in them.
>
> -- Owen
>

Re: InputFormat for fixed-width records?

Posted by Yabo-Arber Xu <ar...@gmail.com>.

Thanks for your reply. It clarifies a lot. The place i was not so sure is
how to read the last record in a split, but now it seems there is no problem
as filesystem has done it for me. :-)

On Tue, Jun 2, 2009 at 12:40 PM, Chuck Lam <ch...@gmail.com> wrote:

> Yes, it's totally possible for part of one record in the first file split
> and the rest in the second file split. It's the job of the RecordReader to
> make sure it's always reading in an entire record. Given a file split, your
> RecordReader has to be able to skip over the first few bytes to get to the
> first full record (if there's a partial record at the beginning). When it
> reaches the end of the split, if there's a partial record there, it will go
> get the rest of the record from the next split.
>
> Tom's email earlier in this thread explained some of the details. Like he
> said, look at LineRecordReader for inspiration. The logic for figuring out
> the start of the first full record is in LineRecordReader itself. The
> RecordReader can read the last record (that spans two file splits) without
> any special logic because the Hadoop filesystem abstracts away file split
> boundaries when reading.
>
>
>
> On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.research@gmail.com
> >wrote:
>
> > I have a follow-up question on this thread: How do we make sure that at
> the
> > getFileSplit phase, there is no records that cross the boundary of
> > different
> > file splits?
> >
> > To explain my point better, for example, if each of my record is 100
> bytes,
> > would there be such case that there is some record whose key was put in
> the
> > 1st filesplit, while its value was put in the second split?
> >
> > Best,
> > Arber
> >
> > On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org>
> > wrote:
> >
> > > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> > >
> > >  I need to process a dataset that contains text records of fixed length
> > >> in bytes.  For example, each record may be 100 bytes in length
> > >>
> > >
> > > The update to the terasort example has an InputFormat that does exactly
> > > that. The key is 10 bytes and the value is the next 90 bytes. It is
> > pretty
> > > easy to write, but I should upload it soon. The output types are Text,
> > but
> > > they just have the binary data in them.
> > >
> > > -- Owen
> > >
> >
>

Re: InputFormat for fixed-width records?

Posted by Chuck Lam <ch...@gmail.com>.

Yes, it's totally possible for part of one record in the first file split
and the rest in the second file split. It's the job of the RecordReader to
make sure it's always reading in an entire record. Given a file split, your
RecordReader has to be able to skip over the first few bytes to get to the
first full record (if there's a partial record at the beginning). When it
reaches the end of the split, if there's a partial record there, it will go
get the rest of the record from the next split.

Tom's email earlier in this thread explained some of the details. Like he
said, look at LineRecordReader for inspiration. The logic for figuring out
the start of the first full record is in LineRecordReader itself. The
RecordReader can read the last record (that spans two file splits) without
any special logic because the Hadoop filesystem abstracts away file split
boundaries when reading.

On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <ar...@gmail.com>wrote:

> I have a follow-up question on this thread: How do we make sure that at the
> getFileSplit phase, there is no records that cross the boundary of
> different
> file splits?
>
> To explain my point better, for example, if each of my record is 100 bytes,
> would there be such case that there is some record whose key was put in the
> 1st filesplit, while its value was put in the second split?
>
> Best,
> Arber
>
> On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> >
> >  I need to process a dataset that contains text records of fixed length
> >> in bytes.  For example, each record may be 100 bytes in length
> >>
> >
> > The update to the terasort example has an InputFormat that does exactly
> > that. The key is 10 bytes and the value is the next 90 bytes. It is
> pretty
> > easy to write, but I should upload it soon. The output types are Text,
> but
> > they just have the binary data in them.
> >
> > -- Owen
> >
>