You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stuart White <st...@gmail.com> on 2009/05/28 14:15:56 UTC

InputFormat for fixed-width records?

I need to process a dataset that contains text records of fixed length
in bytes.  For example, each record may be 100 bytes in length, with
the first field being the first 10 bytes, the second field being the
second 10 bytes, etc...  There are no newlines on the file.  Field
values have been either whitespace-padded or truncated to fit within
the specific locations in these fixed-width records.

Does Hadoop have an InputFormat to support processing of such files?
I looked but couldn't find one.

Of course, I could pre-process the file (outside of Hadoop) to put
newlines at the end of each record, but I'd prefer not to require such
a prep step.

Thanks.

Re: InputFormat for fixed-width records?

Posted by Tom White <to...@cloudera.com>.

Hi Stuart,

There isn't an InputFormat that comes with Hadoop to do this. Rather
than pre-processing the file, it would be better to implement your own
InputFormat. Subclass FileInputFormat and provide an implementation of
getRecordReader() that returns your implementation of RecordReader to
read fixed width records. In the next() method you would do something
like:

byte[] buf = new byte[100];
IOUtils.readFully(in, buf, pos, 100);
pos += 100;

You would also need to check for the end of the stream. See
LineRecordReader for some ideas. You'll also have to handle finding
the start of records for a split, which you can do by looking at the
offset and seeking to the next multiple of 100.

If the RecordReader was a RecordReader<NullWritable, BytesWritable>
(no keys) then it would return each record as a byte array to the
mapper, which would then break it into fields. Alternatively, you
could do it in the RecordReader, and use your own type which
encapsulates the fields for the value.

Hope this helps.

Cheers,
Tom

On Thu, May 28, 2009 at 1:15 PM, Stuart White <st...@gmail.com> wrote:
> I need to process a dataset that contains text records of fixed length
> in bytes.  For example, each record may be 100 bytes in length, with
> the first field being the first 10 bytes, the second field being the
> second 10 bytes, etc...  There are no newlines on the file.  Field
> values have been either whitespace-padded or truncated to fit within
> the specific locations in these fixed-width records.
>
> Does Hadoop have an InputFormat to support processing of such files?
> I looked but couldn't find one.
>
> Of course, I could pre-process the file (outside of Hadoop) to put
> newlines at the end of each record, but I'd prefer not to require such
> a prep step.
>
> Thanks.
>

Re: InputFormat for fixed-width records?

Posted by Yabo-Arber Xu <ar...@gmail.com>.

Thanks for your reply. It clarifies a lot. The place i was not so sure is
how to read the last record in a split, but now it seems there is no problem
as filesystem has done it for me. :-)

On Tue, Jun 2, 2009 at 12:40 PM, Chuck Lam <ch...@gmail.com> wrote:

> Yes, it's totally possible for part of one record in the first file split
> and the rest in the second file split. It's the job of the RecordReader to
> make sure it's always reading in an entire record. Given a file split, your
> RecordReader has to be able to skip over the first few bytes to get to the
> first full record (if there's a partial record at the beginning). When it
> reaches the end of the split, if there's a partial record there, it will go
> get the rest of the record from the next split.
>
> Tom's email earlier in this thread explained some of the details. Like he
> said, look at LineRecordReader for inspiration. The logic for figuring out
> the start of the first full record is in LineRecordReader itself. The
> RecordReader can read the last record (that spans two file splits) without
> any special logic because the Hadoop filesystem abstracts away file split
> boundaries when reading.
>
>
>
> On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.research@gmail.com
> >wrote:
>
> > I have a follow-up question on this thread: How do we make sure that at
> the
> > getFileSplit phase, there is no records that cross the boundary of
> > different
> > file splits?
> >
> > To explain my point better, for example, if each of my record is 100
> bytes,
> > would there be such case that there is some record whose key was put in
> the
> > 1st filesplit, while its value was put in the second split?
> >
> > Best,
> > Arber
> >
> > On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org>
> > wrote:
> >
> > > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> > >
> > >  I need to process a dataset that contains text records of fixed length
> > >> in bytes.  For example, each record may be 100 bytes in length
> > >>
> > >
> > > The update to the terasort example has an InputFormat that does exactly
> > > that. The key is 10 bytes and the value is the next 90 bytes. It is
> > pretty
> > > easy to write, but I should upload it soon. The output types are Text,
> > but
> > > they just have the binary data in them.
> > >
> > > -- Owen
> > >
> >
>

Re: InputFormat for fixed-width records?

Posted by Chuck Lam <ch...@gmail.com>.

Yes, it's totally possible for part of one record in the first file split
and the rest in the second file split. It's the job of the RecordReader to
make sure it's always reading in an entire record. Given a file split, your
RecordReader has to be able to skip over the first few bytes to get to the
first full record (if there's a partial record at the beginning). When it
reaches the end of the split, if there's a partial record there, it will go
get the rest of the record from the next split.

Tom's email earlier in this thread explained some of the details. Like he
said, look at LineRecordReader for inspiration. The logic for figuring out
the start of the first full record is in LineRecordReader itself. The
RecordReader can read the last record (that spans two file splits) without
any special logic because the Hadoop filesystem abstracts away file split
boundaries when reading.

On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <ar...@gmail.com>wrote:

> I have a follow-up question on this thread: How do we make sure that at the
> getFileSplit phase, there is no records that cross the boundary of
> different
> file splits?
>
> To explain my point better, for example, if each of my record is 100 bytes,
> would there be such case that there is some record whose key was put in the
> 1st filesplit, while its value was put in the second split?
>
> Best,
> Arber
>
> On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> >
> >  I need to process a dataset that contains text records of fixed length
> >> in bytes.  For example, each record may be 100 bytes in length
> >>
> >
> > The update to the terasort example has an InputFormat that does exactly
> > that. The key is 10 bytes and the value is the next 90 bytes. It is
> pretty
> > easy to write, but I should upload it soon. The output types are Text,
> but
> > they just have the binary data in them.
> >
> > -- Owen
> >
>

Re: InputFormat for fixed-width records?

Posted by Yabo-Arber Xu <ar...@gmail.com>.

I have a follow-up question on this thread: How do we make sure that at the
getFileSplit phase, there is no records that cross the boundary of different
file splits?

To explain my point better, for example, if each of my record is 100 bytes,
would there be such case that there is some record whose key was put in the
1st filesplit, while its value was put in the second split?

Best,
Arber

On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <om...@apache.org> wrote:

> On May 28, 2009, at 5:15 AM, Stuart White wrote:
>
>  I need to process a dataset that contains text records of fixed length
>> in bytes.  For example, each record may be 100 bytes in length
>>
>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just have the binary data in them.
>
> -- Owen
>

Re: InputFormat for fixed-width records?

Posted by Stuart White <st...@gmail.com>.

On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley <om...@apache.org> wrote:

>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just have the binary data in them.
>

Would you mind uploading it or sending it to the list?

Re: InputFormat for fixed-width records?

Posted by Owen O'Malley <om...@apache.org>.

On May 28, 2009, at 5:15 AM, Stuart White wrote:

> I need to process a dataset that contains text records of fixed length
> in bytes.  For example, each record may be 100 bytes in length

The update to the terasort example has an InputFormat that does  
exactly that. The key is 10 bytes and the value is the next 90 bytes.  
It is pretty easy to write, but I should upload it soon. The output  
types are Text, but they just have the binary data in them.

-- Owen